Manipulating excel data in R can be difficult because often excel users combine data storage, presentation and analysis in the same sheets, which can be heavly formattted (merged columns, multiple column headers, multiple tables per sheet). The Atlas of Variation is an example of this kind of messy ‘messy data’.
The objective is to extract CCG and other data to: + Put in PHOLIO format + New anlaysis
To do this we need to manipulate the data into tidy format.
Now new R tools for manipulating excel sheets - see https://www.r-bloggers.com/how-to-use-jailbreakr/ which we will use to reorganise the spreadsheets and prepare the data for analysis.
Start by loading libraries.
c("Loading required package: c", "Loading required package: hadley/xml2", "Loading required package: rsheets/linen", "Loading required package: rsheets/cellranger", "Loading required package: rsheets/rexcel", "Loading required package: rsheets/jailbreakr")
Failed with error: ‘'package' must be of length 1’
Skipping install of 'xml2' from a github remote, the SHA1 (2dbb567a) has not changed since last install.
Use `force = TRUE` to force installation
Skipping install of 'linen' from a github remote, the SHA1 (7618a13c) has not changed since last install.
Use `force = TRUE` to force installation
Skipping install of 'cellranger' from a github remote, the SHA1 (024d5ba3) has not changed since last install.
Use `force = TRUE` to force installation
Skipping install of 'rexcel' from a github remote, the SHA1 (e8dd5d30) has not changed since last install.
Use `force = TRUE` to force installation
Skipping install of 'jailbreakr' from a github remote, the SHA1 (2fbec5f6) has not changed since last install.
Use `force = TRUE` to force installation
and download the data from the Fingertips website.
trying URL 'https://fingertips.phe.org.uk/documents/DiagAtlasData_291116.xlsx'
Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 9130983 bytes (8.7 MB)
==================================================
downloaded 8.7 MB
[1] "/Users/julianflowers/Documents/R_projects/timeSeries/aov"
Next we can look to see how many sheets are in the dataset - we’ll use rexcel_read_workbook to read in the whole workbook - this takes a little while. We can see there are 38 sheets. We have saved this to and R object we have called workbook.
aov_diag.xlsx [==-------------------------------------------------------] 1 / 38
aov_diag.xlsx [===------------------------------------------------------] 2 / 38
aov_diag.xlsx [====-----------------------------------------------------] 3 / 38
Let’s look at sheet 1 to see what the structure is. The jailbreakr package has a split_sheet function, which extracts tables from the sheet.
List of 4
$ :Classes 'worksheet_view', 'R6' <worksheet_view>
Public:
cells: active binding
clone: function (deep = FALSE)
data: NULL
dim: 2 2
header: NULL
idx: list
initialize: function (sheet, xr, filter, header, data)
lookup: active binding
lookup2: active binding
merged: active binding
sheet: worksheet, R6
table: function (col_names = TRUE, ...)
values: function ()
xr: cell_limits, list
$ :Classes 'worksheet_view', 'R6' <worksheet_view>
Public:
cells: active binding
clone: function (deep = FALSE)
data: NULL
dim: 1051 15
header: NULL
idx: list
initialize: function (sheet, xr, filter, header, data)
lookup: active binding
lookup2: active binding
merged: active binding
sheet: worksheet, R6
table: function (col_names = TRUE, ...)
values: function ()
xr: cell_limits, list
$ :Classes 'worksheet_view', 'R6' <worksheet_view>
Public:
cells: active binding
clone: function (deep = FALSE)
data: NULL
dim: 1050 15
header: NULL
idx: list
initialize: function (sheet, xr, filter, header, data)
lookup: active binding
lookup2: active binding
merged: active binding
sheet: worksheet, R6
table: function (col_names = TRUE, ...)
values: function ()
xr: cell_limits, list
$ :Classes 'worksheet_view', 'R6' <worksheet_view>
Public:
cells: active binding
clone: function (deep = FALSE)
data: NULL
dim: 1050 15
header: NULL
idx: list
initialize: function (sheet, xr, filter, header, data)
lookup: active binding
lookup2: active binding
merged: active binding
sheet: worksheet, R6
table: function (col_names = TRUE, ...)
values: function ()
xr: cell_limits, list
This identifies 4 subtables (look at dim) - one with 2 rows and 2 columns and three with 1050 rows and 15 columns. To extract the data values we need to do a bit more work. We can see the first table contains the title, and subsequent tables appear to have the same data structure but relate to different time periods. In the original spreadsheet these are separated by blank rows.
[1] "Map 1: Rate of computed axial tomography (CT) activity per weighted population by CCG"
Now we have a data frame for the first map which we can further analyse…
For example plotting data over time. We have quarterly and full year data - lets just plot the quarterly data.
This suggests that:
Now we have a method for extracting the data we can (try) to extract each table and combine the CCG data so we can look at the interrelationships between diagnostic tests. [I am doing this sheet by sheet but could be done programmatically although there are several differnt patterns].
After much effort we now have a data frame of the CCG data. There are several different sets of time periods. These will need to be recoded. We have quarterly, annual and three yearly data.